actor_name¶According to the wiki page, we can get rid of those columns:
name_typename_number| pk_actor_name | concat_acna | is_standard_name | lang_iso | name | first_name | ordinal_text | ordinal_num | particle | title | ... | creator | creation_time | modifier | modification_time | concat_name | fk_abob_name_type | begin_month | begin_day | end_month | end_day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10051 | 10359 | AcNa10359 | True | Willaume | Fernand | None | NaN | None | None | ... | 11.0 | 2008-07-18 18:43:44.000 | 3.0 | 2013-02-14 11:00:03 | Willaume, Fernand | NaN | NaN | NaN | NaN | NaN | |
| 51293 | 52157 | AcNa52157 | True | ita | Saletta | Giovanni Alessandro | None | NaN | None | None | ... | 30.0 | 2014-03-29 15:20:27.990 | 30.0 | 2014-03-29 16:09:32 | Saletta, Giovanni Alessandro - da Chiari | 1058.0 | NaN | NaN | NaN | NaN |
| 56773 | 57724 | AcNa57724 | True | deu | Lange | Christian | None | NaN | None | None | ... | 3.0 | 2014-09-11 22:47:27.880 | NaN | NaT | Lange, Christian | 1058.0 | NaN | NaN | NaN | NaN |
| 1813 | 2088 | AcNa2088 | True | Castellano | Fernando | None | NaN | None | None | ... | 27.0 | 2008-11-09 09:24:28.000 | 3.0 | 2013-02-14 11:00:03 | Castellano, Fernando | NaN | NaN | NaN | NaN | NaN | |
| 11487 | 11804 | AcNa11804 | True | Bailhache | None | None | NaN | None | None | ... | 2.0 | 2008-07-19 00:14:18.000 | 3.0 | 2013-02-14 11:00:03 | Bailhache | NaN | NaN | NaN | NaN | NaN |
5 rows × 28 columns
Columns contain: Total number of rows: 67293 - "pk_actor_name": 0.00% empty - 67293 (100.00%) uniques (eg: 49829; 49830; 49832) - "concat_acna": 0.00% empty - 67293 (100.00%) uniques (eg: AcNa49829; AcNa49830; AcNa49832) - "is_standard_name": 0.00% empty - 2 ( 0.00%) uniques (eg: True; False) - "concat_name": 0.00% empty - 63642 ( 94.57%) uniques (eg: Otte, Bern...; Staud, Joh...; Roma, Giul...) - "creation_time": 0.00% empty - 40469 ( 60.14%) uniques (eg: 2013-02-20...; 2013-02-20...; 2013-02-20...) - "fk_actor": 0.00% empty - 61555 ( 91.47%) uniques (eg: 46706; 46707; 46709) - "creator": 0.00% empty - 89 ( 0.13%) uniques (eg: 48.0; 3.0; 41.0) - "name": 3.55% empty - 32301 ( 48.00%) uniques (eg: Otte; Staud; Roma) - "lang_iso": 4.20% empty - 27 ( 0.04%) uniques (eg: None; ita; ) - "modifier": 7.31% empty - 88 ( 0.13%) uniques (eg: 48.0; 3.0; 116.0) - "first_name": 7.88% empty - 12315 ( 18.30%) uniques (eg: Bernhard; Johann; Giulio) - "modification_time": 24.75% empty - 4689 ( 6.97%) uniques (eg: NaT; 2013-02-14...; 2013-02-20...) - "fk_abob_name_type": 70.46% empty - 8 ( 0.01%) uniques (eg: nan; 1058.0; 1060.0) - "notes": 86.47% empty - 420 ( 0.62%) uniques (eg: None; ; Se fait ap...) - "comment_begin_year": 87.64% empty - 25 ( 0.04%) uniques (eg: None; ; En septemb...) - "comment_end_year": 87.68% empty - 12 ( 0.02%) uniques (eg: None; ; Nom parfoi...) - "apposition": 95.00% empty - 1892 ( 2.81%) uniques (eg: None; Acquanegra; Loyola) - "preposition": 95.52% empty - 37 ( 0.05%) uniques (eg: None; dit de; de) - "particle": 95.63% empty - 115 ( 0.17%) uniques (eg: None; d'; van) - "title": 98.37% empty - 229 ( 0.34%) uniques (eg: None; d'; de) - "begin_year": 98.74% empty - 279 ( 0.41%) uniques (eg: 1883.0; 1882.0; nan) - "end_year": 99.49% empty - 210 ( 0.31%) uniques (eg: 1933.0; 1939.0; nan) - "ordinal_text": 99.70% empty - 28 ( 0.04%) uniques (eg: None; VIII; III) - "ordinal_num": 99.90% empty - 9 ( 0.01%) uniques (eg: nan; 8.0; 1.0) - "begin_month": 99.98% empty - 9 ( 0.01%) uniques (eg: nan; 9.0; 3.0) - "begin_day": 99.99% empty - 9 ( 0.01%) uniques (eg: nan; 17.0; 7.0) - "end_month": 99.99% empty - 5 ( 0.01%) uniques (eg: nan; 2.0; 7.0) - "end_day": 99.99% empty - 5 ( 0.01%) uniques (eg: nan; 15.0; 16.0)
According to the table before, we will parse each column by the most meaningful type.
Here we will report the analysis of interesting information found on different columns. They are not exhaustive.
For some of the column, we will update their value.
We create 2 new columns, made of the joining of begin_year, begin_month, begin_day and end_year, end_month, end_day.
Some cleaning is made on this column, in order to fit ISO639-2/T (3 letters code, native prefered, eg 'deu' instead of 'ger').
All HTML tags, non ASCII chars and new line are removed.